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Abstract 


Recurrent and convolutional neural net- 
works comprise two distinct families of 
models that have proven to be useful 
for encoding natural language utterances. 
In this paper we present SoPa, a new 
model that aims to bridge these two ap- 
proaches. SoPa combines neural represen- 
tation learning with weighted finite-state 
automata (WFSAs) to learn a soft version 
of traditional surface patterns. We show 
that SoPa is an extension of a one-layer 
CNN, and that such CNNs are equivalent 
to arestricted version of SoPa, and accord- 
ingly, to a restricted form of WFSA. Em- 
pirically, on three text classification tasks, 
SoPa is comparable or better than both 
a BiLSTM (RNN) baseline and a CNN 
baseline, and is particularly useful in small 
data settings. 


1 Introduction 


Recurrent neural networks (RNNs; Elman, 1990) 
and convolutional neural networks (CNNs; Le- 
Cun, 1998) are two of the most useful text repre- 
sentation learners in NLP (Goldberg, 2016). These 
methods are generally considered to be quite dif- 
ferent: the former encodes an arbitrarily long se- 
quence of text, and is highly expressive (Siegel- 
mann and Sontag, 1995). The latter is more local, 
encoding fixed length windows, and accordingly 
less expressive. In this paper, we seek to bridge the 
gap between RNNs and CNNs, presenting SoPa 
(for Soft Patterns), a model that lies in between 
them. 

SoPa is a neural version of a weighted finite- 
state automaton (WFSA), with a restricted set of 
transitions. Linguistically, SoPa is appealing as it 


“The first two authors contributed equally. 
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Figure 1: A representation of a surface pattern as 
a six-state automaton. Self-loops allow for repeat- 
edly inserting words (e.g., “funny”). e-transitions 
allow for dropping words (e.g., “a”). 


is able to capture a soft notion of surface patterns 
(e.g., “what a great X !”; Hearst, 1992), where 
some words may be dropped, inserted, or replaced 
with similar words (see Figure 1). From a model- 
ing perspective, SoPa is interesting because WF- 
SAs are well-studied and come with efficient and 
flexible inference algorithms (Mohri, 1997; Eis- 
ner, 2002) that SoPa can take advantage of. 

SoPa defines a set of soft patterns of different 
lengths, with each pattern represented as a WFSA 
(Section 3). While the number and lengths of the 
patterns are hyperparameters, the patterns them- 
selves are learned end-to-end. SoPa then repre- 
sents a document with a vector that is the aggre- 
gate of the scores computed by matching each of 
the patterns with each span in the document. Be- 
cause SoPa defines a hidden state that depends on 
the input token and the previous state, it can be 
thought of as a simple type of RNN. 

We show that SoPa is an extension of a one- 
layer CNN (Section 4). Accordingly, one-layer 
CNNs can be viewed as a collection of linear- 
chain WFSAs, each of which can only match 
fixed-length spans, while our extension allows 
matches of flexible-length. As a simple type of 
RNN that is more expressive than a CNN, SoPa 
helps to link CNNs and RNNs. 


To test the utility of SoPa, we experiment with 
three text classification tasks (Section 5). We 
compare against four baselines, including both 
a bidirectional LSTM and a CNN. Our model 
performs on par with or better than all base- 
lines on all tasks (Section 6). Moreover, when 
training with smaller datasets, SoPa is particu- 
larly useful, outperforming all models by sub- 
stantial margins. Finally, building on the con- 
nections discovered in this paper, we offer a 
new, simple method to interpret SoPa (Section 7). 
This method applies equally well to CNNs. We 
release our code at https://github.com/ 
Noahs-—ARK/soft_patterns. 


2 Background 


Surface patterns. Patterns (Hearst, 1992) are 
particularly useful tool in NLP (Lin et al., 2003; 
Etzioni et al., 2005; Schwartz et al., 2015). The 
most basic definition of a pattern is a sequence 
of words and wildcards (e.g., “X is a Y”), which 
can either be manually defined or extracted from a 
corpus using cooccurrence statistics. Patterns can 
then be matched against a specific text span by re- 
placing wildcards with concrete words. 

Davidov et al. (2010) introduced a flexible no- 
tion of patterns, which supports partial matching 
of the pattern with a given text by skipping some 
of the words in the pattern, or introducing new 
words. In their framework, when a sequence of 
text partially matches a pattern, hard-coded partial 
scores are assigned to the pattern match. Here, we 
represent patterns as WFSAs with neural weights, 
and support these partial matches in a soft manner. 


WFESAs. We review weighted finite-state au- 
tomata with e-transitions before we move on to our 
special case in Section 3. A WFSA-e with d states 
over a vocabulary V is formally defined as a tu- 
ple F = (n, T, n), where m € Rê is an initial 
weight vector, T : (V U {e}) > R%7 is a transi- 
tion weight function, and 7 € R? is a final weight 
vector. Given a sequence of words in the vocab- 
ulary x = (£1,..., 2n}, the Forward algorithm 
(Baum and Petrie, 1966) scores x with respect to 
F. Without e-transitions, Forward can be written 
as a series of matrix multiplications: 


Prpan(®) = [Te 0 
i=1 


e-transitions are followed without consuming a 
word, so Equation 1 must be updated to reflect the 


possibility of following any number (zero or more) 
of e-transitions in between consuming each word: 


Pspan(#) = 1 T(e)* | [TET 


where * is matrix asteration: A* := pa AJ. In 
our experiments we use a first-order approxima- 
tion, A* ~ I + A, which corresponds to allow- 
ing zero or one e-transition at a time. When the 
FSA F is probabilistic, the result of the Forward 
algorithm can be interpreted as the marginal prob- 
ability of all paths through F while consuming x 
(hence the symbol “p”). 

The Forward algorithm can be generalized to 
any semiring (Eisner, 2002), a fact that we make 
use of in our experiments and analysis.' The 
vanilla version of Forward uses the sum-product 
semiring: © is addition, ® is multiplication. A 
special case of Forward is the Viterbi algorithm 
(Viterbi, 1967), which sets @ to the max opera- 
tor. Viterbi finds the highest scoring path through 
F while consuming x. Both Forward and Viterbi 
have runtime O(d? + d?n), requiring just a sin- 
gle linear pass through the phrase. Using first- 
order approximate asteration, this runtime drops 
to O(d?n).? 

Finally, we note that Forward scores are for ex- 
act matches—the entire phrase must be consumed. 
We show in Section 3.2 how phrase-level scores 
can be summarized into a document-level score. 


3 SoPa: A Weighted Finite-State 
Automaton RNN 


We introduce SoPa, a WFSA-based RNN, which 
is designed to represent text as collection of sur- 
face pattern occurrences. We start by showing how 
a single pattern can be represented as a WFSA-e 
(Section 3.1). Then we describe how to score a 
complete document using a pattern (Section 3.2), 
and how multiple patterns can be used to encode 
a document (Section 3.3). Finally, we show that 
SoPa can be seen as a simple variant of an RNN 
(Section 3.4). 


'The semiring parsing view (Goodman, 1999) has pro- 
duced unexpected connections in the past (Eisner, 2016). We 
experiment with max-product and max-sum semirings, but 
note that our model could be easily updated to use any semir- 
ing. 

“In our case, we also use a sparse transition matrix (Sec- 
tion 3.1), which further reduces our runtime to O(dn). 


3.1 Patterns as WFSAs 


We describe how a pattern can be represented as a 
WFSA-e. We first assume a single pattern. A pat- 
tern is a WFSA-e, but we impose hard constraints 
on its shape, and its transition weights are given 
by differentiable functions that have the power to 
capture concrete words, wildcards, and everything 
in between. Our model is designed to behave sim- 
ilarly to flexible hard patterns (see Section 2), but 
to be learnable directly and “end-to-end” through 
backpropagation. Importantly, it will still be inter- 
pretable as simple, almost linear-chain, WFSA-e. 

Each pattern has a sequence of d states (in our 
experiments we use patterns of varying lengths be- 
tween 2 and 7). Each state 7 has exactly three pos- 
sible outgoing transitions: a self-loop, which al- 
lows the pattern to consume a word without mov- 
ing states, a main path transition to state 7 + 1 
which allows the pattern to consume one token 
and move forward one state, and an e-transition 
to state 2 + 1, which allows the pattern to move 
forward one state without consuming a token. All 
other transitions are given score 0. When process- 
ing a sequence of text with a pattern p, we start 
with a special START state, and only move for- 
ward (or stay put), until we reach the special END 
state. A pattern with d states will tend to match 
token spans of length d — 1 (but possibly shorter 
spans due to e-transitions, or longer spans due to 
self-loops). See Figure 1 for an illustration. 

Our transition function, T, is a parameterized 
function that returns a d x d matrix. For a word x: 


E(ui v2 + ai), 


0, otherwise, 


[T(x)]; = 


(3) 
where u; and w; are vectors of parameters, a; and 
b; are scalar parameters, Vy is a fixed pre-trained 
word vector for x,t and E is an encoding function, 
typically the identity function or sigmoid. e-tran- 
sitions are also parameterized, but don’t consume 
a token and depend only on the current state: 


Elci), ifj=i+1 


0, otherwise, 


Tle]; = (4) 


where c; is a scalar parameter.” As we have only 


To ensure that we start in the START state and end in the 
END state, we fix 7 = [1,0,...,0] and ņn = [0,...,0, 1]. 

“We use GloVe 300d 840B (Pennington et al., 2014). 

5 Adding €-transitions to WFSAs does not increase their 


if j = i (self-loop) 


three non-zero diagonals in total, the matrix multi- 
plications in Equation 2 can be implemented using 
vector operations, and the overall runtimes of For- 
ward and Viterbi are reduced to O(dn).° 


Words vs. wildcards. Traditional hard patterns 
distinguish between words and wildcards. Our 
model does not explicitly capture the notion of ei- 
ther, but the transition weight function can be in- 
terpreted in those terms. Each transition is a logis- 
tic regression over the next word vector Vg. For 
example, for a main path out of state i, T has two 
parameters, w; and b;. If w; has large magnitude 
and is close to the word vector for some word y 
(e.g., w; ~ 100v,,), and b; is a large negative bias 
(e.g., bi ~ —100), then the transition is essentially 
matching the specific word y. Whereas if w; has 
small magnitude (w; œ~ 0) and b; is a large pos- 
itive bias (e.g., b; = 100), then the transition is 
ignoring the current token and matching a wild- 
card.’ The transition could also be something in 
between, for instance by focusing on specific di- 
mensions of a word’s meaning encoded in the vec- 
tor, such as POS or semantic features like animacy 
or concreteness (Rubinstein et al., 2015; Tsvetkov 
et al., 2015). 


3.2 Scoring Documents 


So far we described how to calculate how well a 
pattern matches a token span exactly (consuming 
the whole span). To score a complete document, 
we prefer a score that aggregates over all matches 
on subspans of the document (similar to “search” 
instead of “match” in regular expression parlance). 
We still assume a single pattern. 

Either the Forward algorithm can be used to cal- 
culate the expected count of the pattern in the doc- 
ument, J 1<i<j<n Pspan (ij), or Viterbi to calcu- 
late Sdoc(@) = MaxX1<i<j<n Sspan(®i.;), the score 
of the highest-scoring match. In short documents, 
we expect patterns to typically occur at most once, 
so in our experiments we choose the Viterbi algo- 
rithm, i.e., the max-product semiring. 


Implementation details. We give the specific 
recurrences we use to score documents in a single 


expressive power, and in fact slightly complicates the For- 
ward equations. We use them as they require fewer parame- 
ters, and make the modeling connection between (hard) flex- 
ible patterns and our (soft) patterns more direct and intuitive. 
Our implementation is optimized to run on GPUs, so the 
observed runtime is even sublinear in d. 
TA large bias increases the eagerness to match any word. 


pass with this model. We define: 
[maxmul(A, B)|i,; = max Aik Bkj- (5) 


We also define the following for taking zero or one 
e-transitions: 


eps (h) = maxmul (h, max(I, T(e))) (6) 


where max is element-wise max. We maintain a 
row vector h; at each token:® 


ho =eps(7'), (Ta) 
hy41 = max (eps(maxmul (hy, T(71+41))), ho), 
(7b) 


and then extract and aggregate END state values: 


s; = maxmul (hz, n), (8a) 
(8b) 


Sdoc = MaX St. 
1<t<n 


[h]; represents the score of the best path through 
the pattern that ends in state 2 after consuming t 
tokens. By including ho in Equation 7b, we are 
accounting for spans that start at time t + 1. s+ 
is the maximum of the exact match scores for all 
spans ending at token t. And S4doc is the maximum 
score of any subspan in the document. 


3.3 Aggregating Multiple Patterns 


We describe how & patterns are aggregated to 
score a document. These k patterns give k dif- 
ferent Sdoc scores for the document, which are 
stacked into a vector z € R* and constitute the 
final document representation of SoPa. This vec- 
tor representation can be viewed as a feature vec- 
tor. In this paper, we feed it into a multilayer per- 
ceptron (MLP), culminating in a softmax to give a 
probability distribution over document labels. We 
minimize cross-entropy, allowing the SoPa and 
MLP parameters to be learned end-to-end. 

SoPa uses a total of (2e + 3)dk parameters, 
where e is the word embedding dimension, d is the 
number of states and k is the number of patterns. 
For comparison, an LSTM with a hidden dimen- 
sion of h has 4((e + 1)h + h?). In Section 6 we 
show that SoPa consistently uses fewer parameters 
than a BiLSTM baseline to achieve its best result. 


ŝHere a row vector h of size n can also be viewed as a 
1 x n matrix. 


3.4 SoPa as an RNN 


SoPa can be considered an RNN. As shown in Sec- 
tion 3.2, a single pattern with d states has a hidden 
state vector of size d. Stacking the k hidden state 
vectors of k patterns into one vector of size k x d 
can be thought of as the hidden state of our model. 
This hidden state is, like in any other RNN, depen- 
dent of the input and the previous state. Using self- 
loops, the hidden state at time point z can in theory 
depend on the entire history of tokens up to x; (see 
Figure 2 for illustration). We do want to discour- 
age the model from following too many self-loops, 
only doing so if it results in a better fit with the 
remainder of the pattern. To do this we use the 
sigmoid function as our encoding function E (see 
Equation 3), which means that all transitions have 
scores strictly less than 1. This works to keep pat- 
tern matches close to their intended length. Using 
other encoders, such as the identity function, can 
result in different dynamics, potentially encourag- 
ing rather than discouraging self-loops. 

Although even single-layer RNNs are Turing 
complete (Siegelmann and Sontag, 1995), SoPa’s 
expressive power depends on the semiring. When 
a WFSA is thought of as a function from finite 
sequences of tokens to semiring values, it is re- 
stricted to the class of functions known as rational 
series (Schiitzenberger, 1961; Droste and Gastin, 
1999; Sakarovitch, 2009). It is unclear how lim- 
iting this theoretical restriction is in practice, es- 
pecially when SoPa is used as a component in a 
larger network. We defer the investigation of the 
exact computational properties of SoPa to future 
work. In the next section, we show that SoPa is 
an extension of a one-layer CNN, and hence more 
expressive. 


4 SoPa as a CNN Extension 


A convolutional neural network (CNN; LeCun, 
1998) moves a fixed-size sliding window over the 
document, producing a vector representation for 
each window. These representations are then of- 
ten summed, averaged, or max-pooled to produce 
a document-level representation (Kim, 2014; Yin 
and Schiitze, 2015). In this section, we show 
that SoPa is an extension of one-layer, max-pooled 
CNNs. 

To recover a CNN from a soft pattern with d+ 1 
states, we first remove self-loops and e€-transitions, 


Rational series generalize recognizers of regular lan- 
guages, which are the special case of the Boolean semiring. 


Fielding’s funniest and most 
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Figure 2: State activations of two patterns as they score a document. pattern! (length three) matches 
on “in years”. pattern2 (length five) matches on “funniest and most likeable book”, using a self-loop to 
consume the token “most”. Active states in the best match are marked with arrow cursors. 


retaining only the main path transitions. We also 
use the identity function as our encoder E (Equa- 
tion 3), and use the max-sum semiring. With only 
main path transitions, the network will not match 
any span that is not exactly d tokens long. Using 
max-sum, spans of length d will be assigned the 
score: 


d—1 
Sspan(Zi:i+ad) = X Wj Vais; Ea bj, (9a) 
j=0 
d-1 
=Wo0:d . Vtiipa EE X bj, (9b) 
j=0 
Te o waged T 
where Woa = [Wọ0;---; W41] > Vzriga 
[vi.3 Oe Vaaa] 4 Rearranged this way, we rec- 


ognize the span score as an affine transformation 
of the concatenated word vectors Ve;;}a If we 
use k patterns, then together their span scores cor- 
respond to a linear filter with window size d and 
output dimension k.!° A single pattern’s score for 
a document is: 


max 


(10) 
1<i<n—d+1 


Sdoc(@) = BSipan (ised): 
The max in Equation 10 is calculated for each 
pattern independently, corresponding exactly to 
element-wise max-pooling of the CNN’s output 
layer. 

Based on the equivalence between this impov- 
erished version of SoPa and CNNs, we conclude 
that one-layer CNNs are learning an even more 


This variant of SoPa has d bias parameters, which cor- 
respond to only a single bias parameter in a CNN. The re- 
dundant biases may affect optimization but are an otherwise 
unimportant difference. 


restricted class of WFSAs (linear-chain WFSAs) 
that capture only fixed-length patterns. 

One notable difference between SoPa and arbi- 
trary CNNs is that in general CNNs can use any 
filter (like an MLP over v;,,.,,,, for example). In 
contrast, in order to efficiently pool over flexible- 
length spans, SoPa is restricted to operations that 
follow the semiring laws.!! 

As a model that is more flexible than a one-layer 
CNN, but (arguably) less expressive than many 
RNNs, SoPa lies somewhere on the continuum be- 
tween these two approaches. Continuing to study 
the bridge between CNNs and RNNS is an exciting 
direction for future research. 


5 Experiments 


To evaluate SoPa, we apply it to text classification 
tasks. Below we describe our datasets and base- 
lines. More details can be found in Appendix A. 


Datasets. We experiment with three binary clas- 
sification datasets. 


e SST. The Stanford Sentiment Treebank (Socher 
et al., 2013)!” contains roughly 10K movie re- 
views from Rotten Tomatoes,!> labeled on a 
scale of 1-5. We consider the binary task, which 
considers 1 and 2 as negative, and 4 and 5 as 
positive (ignoring 3s). It is worth noting that this 
dataset also contains syntactic phrase level an- 
notations, providing a sentiment label to parts of 


‘The max-sum semiring corresponds to a linear filter with 
max-pooling. Other semirings could potentially model more 
interesting interactions, but we leave this to future work. 

Phttps://nlp.stanford.edu/sentiment / 
index.html 

Bhetp://www.rottentomatoes.com 


sentences. In order to experiment in a realistic 
setup, we only consider the complete sentences, 
and ignore syntactic annotations at train or test 
time. The number of training/development/test 
sentences in the dataset is 6,920/872/1,821. 


e Amazon. The Amazon Review Corpus 
(McAuley and Leskovec, 2013)!* contains elec- 
tronics product reviews, a subset of a larger re- 
view dataset. Each document in the dataset con- 
tains a review and a summary. Following Yo- 
gatama et al. (2015), we only use the reviews 
part, focusing on positive and negative reviews. 
The number of training/development/test sam- 
ples is 20K/SK/25K. 


e ROC. The ROC story cloze task (Mostafazadeh 
et al., 2016) is a story understanding task.!> The 
task is composed of four-sentence story pre- 
fixes, followed by two competing endings: one 
that makes the joint five-sentence story coher- 
ent, and another that makes it incoherent. Fol- 
lowing Schwartz et al. (2017), we treat it as a 
style detection task: we treat all “right” endings 
as positive samples and all “wrong” ones as neg- 
ative, and we ignore the story prefix. We split 
the development set into train and development 
(of sizes 3,366 and 374 sentences, respectively), 
and take the test set as-is (3,742 sentences). 


Reduced training data. In order to test our 
model’s ability to learn from small datasets, we 
also randomly sample 100, 500, 1,000 and 2,500 
SST training instances and 100, 500, 1,000, 2,500, 
5,000, and 10,000 Amazon training instances. De- 
velopment and test sets remain the same. 


Baselines. We compare to four baselines: a BiL- 
STM, a one-layer CNN, DAN (a simple alterna- 
tive to RNNs) and a feature-based classifier trained 
with hard-pattern features. 


e BiLSTM. Bidirectional LSTMs have been suc- 
cessfully used in the past for text classification 
tasks (Zhou et al., 2016). We learn a one-layer 
BiLSTM representation of the document, and 
feed the average of all hidden states to an MLP. 


e CNN. CNNs are particularly useful for text 
classification (Kim, 2014). We train a one-layer 
CNN with max-pooling, and feed the resulting 
representation to an MLP. 


Mntetp://riejohnson.com/cnn_data.html 
Shtetp://cs.rochester.edu/nlp/ 
rocstories/ 


e DAN. We learn a deep averaging network with 
word dropout (Iyyer et al., 2015), a simple but 
strong text-classification baseline. 


e Hard. We train a logistic regression classifier 
with hard-pattern features. Following Tsur et al. 
(2010), we replace low frequency words with a 
special wildcard symbol. We learn sequences of 
1—6 concrete words, where any number of wild- 
cards can come between two adjacent words. 
We consider words occurring with frequency of 
at least 0.01% of our training set as concrete 
words, and words occurring in frequency 1% or 
less as wildcards.!° 


Number of patterns. SoPa requires specifying 
the number of patterns to be learned, and their 
lengths. Preliminary experiments showed that the 
model doesn’t benefit from more than a few dozen 
patterns. We experiment with several configu- 
rations of patterns of different lengths, generally 
considering 0, 10 or 20 patterns of each pattern 
length between 2-7. The total number of patterns 
learned ranges between 30-70. !7 


6 Results 


Table 1 shows our main experimental results. In 
two of the cases (SST and ROC), SoPa outper- 
forms all models. On Amazon, SoPa performs 
within 0.3 points of CNN and BiLSTM, and out- 
performs the other two baselines. The table also 
shows the number of parameters used by each 
model for each task. Given enough data, mod- 
els with more parameters should be expected to 
perform better. However, SoPa performs better or 
roughly the same as a BiLSTM, which has 3-6 
times as many parameters. 

Figure 3 shows a comparison of all models on 
the SST and Amazon datasets with varying train- 
ing set sizes. SoPa is substantially outperform- 
ing all baselines, in particular BiLSTM, on small 
datasets (100 samples). This suggests that SoPa is 
better fit to learn from small datasets. 


Ablation analysis. Table 1 also shows an abla- 
tion of the differences between SoPa and CNN: 
max-product semiring with sigmoid vs. max-sum 
semiring with identity, self-loops, and e-transi- 
tions. The last line is equivalent to a CNN with 


'©Some words may serve as both words and wildcards. See 
Davidov and Rappoport (2008) for discussion. 

'’The number of patterns and their length are hyperparam- 
eters tuned on the development data (see Appendix A). 


Model ROC SST Amazon 
Hard 62.2 (4K) 75.5 (6K) 88.5 (67K) 
DAN 64.3 (91K) 83.1(91K) 85.4 (91K) 
BiLSTM 65.2 (844K) 84.8 (1.5M) 90.8 (844K) 
CNN 64.3 (155K) 82.2 (62K) 90.2 (305K) 
SoPa 66.5 (255K) 85.6 (255K) 90.5 (256K) 
SoPams, 64.4 84.8 90.0 
SoPams,\{sl} 63.2 84.6 89.8 
SoPans,\{e} 64.3 83.6 89.7 
SoPams; \{sl, €} 64.0 85.0 89.5 

Table 1: Test classification accuracy (and the 


number of parameters used). The bottom part 
shows our ablation results: SoPa: our full model. 
SoPams,: running with max-sum semiring (rather 
than max-product), with the identity function as 
our encoder FE (see Equation 3). sl: self-loops, 
e: € transitions. The final row is equivalent to a 
one-layer CNN. 


70 - 


= DAN 
—e Hard 
—+ BiLSTM 
— CNN 


Classification Accuracy 


—e— SoPa (ours) 


Num. Training Samples (SST) 


Figure 3: Test accuracy on SST and Amazon with 
varying number of training instances. 


multiple window sizes. Interestingly, the most no- 
table difference between SoPa and CNN is the 
semiring and encoder function, while e transitions 
and self-loops have little effect on performance. 18 


7 Interpretability 


We turn to another key aspect of SoPa—its inter- 
pretability. We start by demonstrating how we in- 
terpret a single pattern, and then describe how to 
interpret the decisions made by downstream clas- 
sifiers that rely on SoPa—in this case, a sentence 
classifier. Importantly, these visualization tech- 
niques are equally applicable to CNNs. 


Interpreting a single pattern. In order to visu- 
alize a pattern, we compute the pattern matching 
scores with each phrase in our training dataset, and 
select the k phrases with the highest scores. Ta- 
ble 2 shows examples of six patterns learned us- 
ing the best SoPa model on the SST dataset, as 


18 Although SoPa does make use of them—see Section 7. 


wal i posit ae ET Lu iiil I ooon io 
100 1,000 10,000 100 1,000 10,000 


Num. Training Samples (Amazon) 


Highest Scoring Phrases 
thoughtful reverent portrait of 
and astonishingly articulate cast of 
Patt. 1] entertaining thought-provoking film with 
gentle F mesmerizing portrait of 
poignant and uplifting story in 
*s € uninspired story : 
this € bad on purpose 
Patt. 2 | this € leaden comedy 
a € half-assed film 
is € clumsy sz, the writing 
mesmerizing portrait of a 
engrossing portrait of a 
Patt. 3)clear-eyed portrait of an 
fascinating portrait of a 
self-assured portrait of small 
honest : and enjoyable 
soulful , Scathings, and joyous 
Patt. 4)unpretentious , charmings, , quirky 
forceful i and beautifully 
energetic ; and surprisingly 
is deadly dull 
a numbingly dull 
Patt. 5 | is remarkably dull 
is a phlegmatic 
an utterly incompetent 
five minutes 
four minutes 
Patt. 6 | final minutes 
first half-hour 
fifteen minutes 


Table 2: Six patterns of different lengths learned 
by SoPa on SST. Each group represents a single 
pattern p, and shows the five phrases in the training 
data that have the highest score for p. Columns 
represent pattern states. Words marked with sz are 
self-loops. e symbols indicate e-transitions. All 
other words are from main path transitions. 


represented by their five highest scoring phrases 
in the training set. A few interesting trends can 
be observed from these examples. First, it seems 
our patterns encode semantically coherent expres- 
sions. A large portion of them correspond to senti- 
ment (the five top examples in the table), but others 
capture different semantics, e.g., time expressions. 

Second, it seems our patterns are relatively soft, 
and allow lexical flexibility. While some patterns 
do seem to fix specific words, e.g., “of” in the first 
example or “minutes” in the last one, even in those 
cases some of the top matching spans replace these 
words with other, similar words (“with” and “half- 
hour”, respectively). Encouraging SoPa to have 
more concrete words, e.g., by jointly learning the 
word vectors, might make SoPa useful in other 
contexts, particularly as a decoder. We defer this 
direction to future work. 

Finally, SoPa makes limited but non-negligible 
use of self-loops and epsilon steps. Interestingly, 
the second example shows that one of the pat- 


Analyzed Documents 


it `s dumb , but more importantly , if ’s just not scary 


though moonlight mile is replete with acclaimed actors and 
actresses and tackles a subject that ’s potentially moving , 
the movie is too predictable and too self-conscious to reach a 
level of high drama 


While its careful pace and seemingly opaque story may not 
satisfy every moviegoer ’s appetite, the film ’s final scene is 
soaringly , transparently moving 


unlike the speedy wham-bam effect of most hollywood of- 
ferings , character development — and more importantly, 
character empathy — is at the heart of italian for beginners . 


the band ’s courage in the face of official repression is in- 
spiring , especially for aging hippies ( this one included ) . 


Table 3: Documents from the SST training data. 
Phrases with the largest contribution toward a pos- 
itive sentiment classification are in bold green, 
and the most negative phrases are in italic orange. 


terns had an e-transition at the same place in every 
phrase. This demonstrates a different function of 
e-transitions than originally designed—they allow 
a pattern to effectively shorten itself, by learning a 
high e-transition parameter for a certain state. 


Interpreting a document. SoPa provides an in- 
terpretable representation of a document—a vec- 
tor of the maximal matching score of each pat- 
tern with any span in the document. To visual- 
ize the decisions of our model for a given docu- 
ment, we can observe the patterns and correspond- 
ing phrases that score highly within it. 

To understand which of the k patterns con- 
tributes most to the classification decision, we ap- 
ply a leave-one-out method. We run the forward 
method of the MLP layer in SoPa k times, each 
time zeroing-out the score of a different pattern 
p. The difference between the resulting score and 
the original model score is considered p’s contri- 
bution. We then consider the highest contributing 
patterns, and attach each one with its highest scor- 
ing phrase in that document. Table 3 shows exam- 
ple texts along with their most positive and nega- 
tive contributing phrases. 


8 Related Work 


Weighted finite-state automata. WFSAs and 
hidden Markov models!? were once popular in au- 
tomatic speech recognition (Hetherington, 2004; 
Moore et al., 2006; Hoffmeister et al., 2012) 


'SHMMs are a special case of WFSAs (Mohri et al., 2002). 


and remain popular in morphology (Dreyer, 2011; 
Cotterell et al., 2015). Most closely related to this 
work, neural networks have been combined with 
weighted finite-state transducers to do morpholog- 
ical reinflection (Rastogi et al., 2016). These prior 
works learn a single FSA or FST, whereas our 
model learns a collection of simple but comple- 
mentary FSAs, together encoding a sequence. We 
are the first to incorporate neural networks both 
before WFSAs (in their transition scoring func- 
tions), and after (in the function that turns their 
vector of scores into a final prediction), to produce 
an expressive model that remains interpretable. 


Recurrent neural networks. The ability of 
RNNs to represent arbitrarily long sequences of 
embedded tokens has made them attractive to 
NLP researchers. The most notable variants, 
the long short-term memory (LSTM; Hochreiter 
and Schmidhuber, 1997) and gated recurrent units 
(GRU; Cho et al., 2014), have become ubiqui- 
tous in NLP algorithms (Goldberg, 2016). Re- 
cently, several works introduced simpler versions 
of RNNs, such as recurrent additive networks (Lee 
et al., 2017) and Quasi-RNNs (Bradbury et al., 
2017). Like SoPa, these models can be seen as 
points along the bridge between RNNs and CNNs. 

Other works have studied the expressive power 
of RNNs, in particular in the context of WFSAs 
or HMMs (Cleeremans et al., 1989; Giles et al., 
1992; Visser et al., 2001; Chen et al., 2018). In 
this work we relate CNNs to WFSAs, showing that 
a one-layer CNN with max-pooling can be simu- 
lated by a collection of linear-chain WFSAs. 


Convolutional neural networks. CNNs are 
prominent feature extractors in NLP, both for gen- 
erating character-based embeddings (Kim et al., 
2016), and as sentence encoders for tasks like 
text classification (Yin and Schiitze, 2015) and 
machine translation (Gehring et al., 2017). Sim- 
ilarly to SoPa, several recently introduced vari- 
ants of CNNs support varying window sizes by ei- 
ther allowing several fixed window sizes (Yin and 
Schiitze, 2015) or by supporting non-consecutive 
n-gram matching (Lei et al., 2015; Nguyen and 
Grishman, 2016). 


Neural networks and patterns. Some works 
used patterns as part of a neural network. 
Schwartz et al. (2016) used pattern contexts for 
estimating word embeddings, showing improved 
word similarity results compared to bag-of-word 


contexts. Shwartz et al. (2016) designed an 
LSTM representation for dependency patterns, us- 
ing them to detect hypernymy relations. Here, we 
learn patterns as a neural version of WFSAs. 


Interpretability. There have been several ef- 
forts to interpret neural models. The weights of the 
attention mechanism (Bahdanau et al., 2015) are 
often used to display the words that are most sig- 
nificant for making a prediction. LIME (Ribeiro 
et al., 2016) is another approach for visualizing 
neural models (not necessarily textual). Yogatama 
and Smith (2014) introduced structured sparsity, 
which encodes linguistic information into the reg- 
ularization of a model, thus allowing to visualize 
the contribution of different bag-of-word features. 

Other works jointly learned to encode text and 
extract the span which best explains the model’s 
prediction (Yessenalina et al., 2010; Lei et al., 
2016). Li et al. (2016) and Kádár et al. (2017) sug- 
gested a method that erases pieces of the text in or- 
der to analyze their effect on a neural model’s de- 
cisions. Finally, several works presented methods 
to visualize deep CNNs (Zeiler and Fergus, 2014; 
Simonyan et al., 2014; Yosinski et al., 2015), fo- 
cusing on visualizing the different layers of the 
network, mainly in the context of image and video 
understanding. We believe these two types of 
research approaches are complementary: invent- 
ing general purpose visualization tools for exist- 
ing black-box models on the one hand, and on the 
other, designing models like SoPa that are inter- 
pretable by construction. 


9 Conclusion 


We introduced SoPa, a novel model that combines 
neural representation learning with WFSAs. We 
showed that SoPa is an extension of a one-layer 
CNN. It naturally models flexible-length spans 
with insertion and deletion, and it can be easily 
customized by swapping in different semirings. 
SoPa performs on par with or strictly better than 
four baselines on three text classification tasks, 
while requiring fewer parameters than the stronger 
baselines. On smaller training sets, SoPa outper- 
forms all four baselines. As a simple version of 
an RNN, which is more expressive than one-layer 
CNNs, we hope that SoPa will encourage future 
research on the bridge between these two mecha- 
nisms. To facilitate such research, we release our 
implementation at https://github.com/ 
Noahs-—ARK/soft_patterns. 
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Appendices 


A Experimental Setup 


We implemented all neural models in PyTorch,”° 
and the Hard baseline in scikit-learn (Pedregosa 
et al., 2011).2!_ We train using Adam (Kingma 
and Ba, 2015) with a batch size of 150. We use 
300-dimensional GloVe 840B embeddings (Pen- 
nington et al., 2014) normalized to unit length. 
We randomly initialize all other parameters. Our 
MLP has two layers. For regularization, we use 
dropout.” 

In all cases, we tune the hyperparameters of our 
model on the development set by running 30 iter- 
ations of random search. The full list of hyper- 
parameters explored for each model can be found 
in Table 4. Finally, we train all models for 250 
epochs, stopping early if development loss does 
not improve for 30 epochs. 


*nttps://pytorch.org/ 

7http://scikit-learn.org/ 

? DAN uses word dropout instead of regular dropout as its 
only learnable parameters are the MLP layer weights. 


Type Values Models 


Patterns {5:10,4:10,3:10,2:10}, SoPa 
{6:10,5:10,4:10}, 
{6:10,5:10,4:10,3:10,2:10}, 
{6:20,5:20,4:10,3:10,2:10}, 
{7:10,6:10,5:10,4:10,3:10,2:10} 


Learning rate 0.01, 0.05, 0.001, 0.005 SoPa, DAN, BiLSTM, 


CNN 

Dropout 0, 0.05, 0.1, 0.2 SoPa, BiLSTM, CNN 

MLP hid. dim. 10, 25, 50, 100, 300 SoPa, DAN, BiLSTM, 
CNN 

Hid. layer dim. 100, 200, 300 BiLSTM 

Out. layer dim. 50, 100, 200 CNN 

Window size 4,5,6 CNN 

Word dropout 0.1, 0.2, 0.3, 0.4 DAN 


Log. reg. param 1, 0.5, 0.1, 0.05, 0.01, Hard 


0.005, 0.001 


Min. pattern freq. 2-10, 0.1% Hard 


Table 4: The hyperparameters explored in our ex- 
periments. Patterns: the number of patterns of 
each length. For example, {5:20,4:10} means 20 
patterns of length 5 and 10 patterns of length 4. 
MLP hid. dim.: the dimension of the hidden layer 
of the MLP. Hid. layer dim.: the BiLSTM hid- 
den layer dimension. Out. layer dim.: the CNN 
output layer dimension. Window size: the CNN 
window size. Log. reg. param: the logistic re- 
gression regularization parameter. Min. pattern 
freq.: minimum frequency for a pattern to be in- 
cluded as a logistic regression feature, expressed 
either as absolute count or as relative frequency in 
the train set. Models: the models to which each 
hyperparameter applies (see Section 5). 


